Data Selection for IT Texts using Paragraph Vector

نویسندگان

  • Mirela-Stefania Duma
  • Wolfgang Menzel
چکیده

This paper presents an overview of the system submitted by the University of Hamburg to the IT domain shared translation task as part of the ACL 2016 First Conference of Machine Translation (WMT 2016). We have chosen data selection as a domain adaptation method. The filtering of the general domain data makes use of paragraph vectors as a novel approach for scoring the sentences. Experiments were conducted for English-German under the constrained condition.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generative Paragraph Vector

The recently introduced Paragraph Vector is an efficient method for learning highquality distributed representations for pieces of texts. However, an inherent limitation of Paragraph Vector is lack of ability to infer distributed representations for texts outside of the training set. To tackle this problem, we introduce a Generative Paragraph Vector, which can be viewed as a probabilistic exten...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Document Embedding with Paragraph Vectors

Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide...

متن کامل

Sentiment Analysis with Deeply Learned Distributed Representations of Variable Length Texts

Learning good semantic vector representations for phrases, sentences and paragraphs is a challenging and ongoing area of research in natural language processing and understanding. In this project, we survey and implement several deeplearning and deep-learning-inspired approaches and evaluate these algorithms on several sentiment-labeled datasets and analysis tasks. In doing so, we demonstrate n...

متن کامل

Distributed Representations of Sentences and Documents

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, “powerful,” “strong” and “Paris” are eq...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016